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Four indices for investigating inter-observer 
accuracy in observational instruments (contingency coefficient, 
Scott's pi, Bernstein's coefficient, and percent agreement) are 
reviewed concerning their assumptions, formulation, and taJbles 
indicating numerical functioning. Three of the four indices 
(excluding the contingency coefficient) are compared by computing 
each for four sets of observational data. It was found that 
Bernstein's coefficient had the highest median and the sirallest 
range, percent agreement the second highest median and the second ' 
smallest range, and Scott's pi the lowest median and the largest 
range. It is hoped that au-Lhors will employ this information in their 
practical application and interpretation of these indices. 
(Author) 
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Review of Indices 

Four indices for investigating inter-observer accuracy (agreement 
between a criterion observer and another observer or two observers) of 
observational instruments have been selected. The contingency coefficient 
(C) is often used for determining che relationship between two nominal 
variables and is based on the results of a chi-square test of independence. 
The second coefficient is percent of inter-cbr.crver agrec.ncni: (P) ; this 
coefficient is an input for the following two coefficients. 'Vhe third, 
Bei^nstein^s (1968) coefficient (1^^), been chosen as the assuinptions 
provided in its derivation are generally applicable for examining inter- 
O observer accuracy of instriu'.ionts. And the fourth, Scott's (19SS) £i^, was 
chosen because of its traditional usage as an accuracy index for observa- 
^^■^^ tional data. 



The formulation for each of the coefficients is presented in Table 1. 
The calculation of the contingency coefficient is based on the results of 

a chi-square test of independence (See Table 1). Tlie two assumptions for 

2 2 
X , and thus tlie contingency coefficient, are that X be used with nominal 

or classification data and that the categories for X' be mutually exclusive. 
Assuming the use of nominal data with expected cell entries greater than 
five, the contingency coefficient is also restricted by the size of the 
array (number of rows and/or columns). The computation of C^,^.^^ ^^nd subse- 
quent comparison of C to C^^^^^ (^/^(max)^ gives a corrected estimate of 
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relationship of the two classification variables based on the size of the 
array, Garrett (1967, p. 395) provides information as to the relationship 
of the contingency coefficient and the product -moment correlation coeffi- 
cient. The value of C ranges from 0 to 1,00, 



Insert Table 1 about here 



Percent agreement assumes the use of mutually exclusive categories, 
as does the contingency coefficient; the calculation for percent agreement 
is presented in Table 1. Two sets of criteria for determining percent of 
agreement between two observers were employed: 1, Same category at same 
time (C); 2. Same category at same time in same who-to-whom column (E), 
The range of values for percent agreement is from 0 to 100%* 

An abbreviated form of the derivation of Bernstein's (1968) coeffi- 
cient is presented in Table 2. The range of possible values for Bernstein's 
coefficient is presented in Table 3 for .05 intervals of P beginning at 
P = .51 (Bernstein assumes that P is no xess than .51, thus any value of 
P less than .51 results in P, equal to .CO). 



Insert Tables 2 5 3 about here 



The calculation of Scott's (195S) pi^, often :ited in the literature 
as an observational instrument inter-observer accuracy coefficient^ is pre 
sented in Tabic 1. is dependent on the number of categories employed. 
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TABLE 1 
Formulation of Coefficients 



1. Contingency Coefficients: 



2. Percent of Agrecnient: 



3, Bemstein^s P^: 
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2 <-i 9 

^ " / (0"^) 0 = Observed Frequencies 
E E = Expected Frequencies 

N = Total Number of Observations 5 



^mnv "I where k = # of arrays, either 

max J — r— 

^ >^ columns or rows 



P = Number of Agreements 

Total Number of Possible Agreements 

Pg = Exact percent of agreement, i.e., 
two observers recording the same 
category at same time in same who-to- 
whom column is the criterion for number 
of agreements, 

Vq = ColujTin-time percent of agreement, 

i.e., two observers recording the same 
category at same time is the criterion 
for nanber of agreements. 



1 v/ 2A - 1 

2 

A = Pg or ?Q as defined above. 



Pb = 2 



4. Scott's pi: P - P 

A 



Ei = 1 - I'e 
P ■= or Pq as defined above. 

/. . f / , where k is the total number 
i = l of categories used and Pj^ 

is the proportion of the entire 
sample which falls .in the i^^ 
category. 
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TABLE 2 

Derivation of Bernstein's Coefficient* 



Definitions: 



Px = probability that coder X will correctly code a given item 
Py = probability that coder Y will correctly code a given item 

A ^ Ratio (percent) of agreement in the set of paired coues 
derived from matching the codings. 



Now: Qx = 1-Px 



Qy = 1-Py 



Assumptions ; 



!• Px and Py are constant and independent. 
2, The number of categories is constant. 

The probabilities associated with the possible outcomes 

for X and Y are given by: (Px > Qx) (Py + Qy) := IxPy + QxPy + QyPx + QxQy 



or 



Outcomes 



^rob. 



X and Y correct PxPy 

X correct, Y incorrect PxQy 

X incorrect, Y correct QxPy 

X and Y correct QxQy 



We can now see that: A = PxPy + QxQyK 



Nature of Agreement 
and Disaf>reement 

X and Y agree 

X and Y disagree 

X and Y disagree 

X and Y agree on the same 
incorrect code or X and Y 
disagree, but both select 
an incorrect code 



where K = fraction of events in the set associated with the 
probability QxQy, for which X and Y have selected the same 
incorrect code. K can be estimated by a variety of 
assumptions. 

Now: A = PxPy + QxQyK can be written as 

A = PxPy + (1-Px) (l-Py)K 

If the two coders X and Y are properly trained, it is reasonable to 
assume that Px = Py = P 

A = p2 + (1-!')2k or + K-2PK + P^K = A 

Solution of this quadratic gives K * J A (1+K)-K 



P = 



1 + K 



TAHLE 2 ((-..iuinued) 



The restriction of A ^ 1/2 r:nd P>l/2 seems reasonable in 
situations in which pcrconts of tigrcLiinent aix employed (for example, 
the investigation of j ntcr-ob^-.c i ver accur^ity of observational in- 
struments). With these restrictions and with 05K<1, the smaller 
quadratic root 

P = K " A (1 K) ^ K 
' 1 + K 

K 

is excluded since tlie largest possible P = 1 + K which is 
less than or equal to 5/2, is attained only when 

A = K ±^ 1/2 
1 + K 



the values of P can be 
calculated. 



Using the larger quadratic x*oot 

K 4- \ / A - K 

P = 1 + k " 



Tlie extreme values of K = 1 and K - 0 lead to slightly different 
values fi*o;u each other until A is as lo\.' as .70. 

K = l**and thus / " 

1 y/ 2A - 1 

P = 2 

which is the fonnulation of Bcrnstein^s (1968) coefficient 
used in this paper. 



^This is abstrated from Bernstein (1968). The complete derivation may 
be obtained from the previous reference. 
**K chosen cqua) tO 0 gives slightly different results. 



TABLR 3 
Value of Bernstein's. P. ^ 



Percent Agreement 

lOO'o 
95% 

85 % 
80% 
75% 
70% 
65% 
60% 
55% 
51% 



1.00 
.97 
.95 
.92 
.89 
.85 
.82 
.77 
.72 
.65 
.57 



^Assumes Percent of Agreement is greater 
than or equal to .51 



The error term (P^) for increases as the number of categories used during 
any one ob.servation period doorcases. Those categorici. used most often get 
a disproportionately highei weighting in the error term because of the nature 
of squaring decimals, i.e. . . . 10^ = .01/.402 = .16. IT'.e possible values 
for jn arc presented in Tabii f)r intervals of .OS for both P and P^. 
Values ok ?^ arc located on the top horizontal margin of the matrix, and values 
of P are locstcv: on the left vertical margin of the matrix. A generally ac- 
cepted lov;er limit for accuracy is approximately .70. The heavy line in 
Table 4 indicates that ;^-rt^ i n of the matrix which ccntains positive values 
of £i^ greater than or equal to .70. Tnc least value of P uhich proviaes a 

value greater than .70 is P = .75. And in this case, P^ equals .15 or less. 
For example, assuming one has 10 categories in his coding system, no particular 
categoiT could be employed 40% of the time and few categories could be em- 
ployed 20% of the time, the remainder being employed 10% or less, if one 
wanted to obtain a jvi^ equal to .70 or greater, lliis additionally assumes 
that the percent of agreement is .75 or greater. 

Insert Table 4 about here 

Comparison of Indice s 

The contingency coefficient was not used for the comparisons. Garrett 
(1967, p. 258) states that the expected value of entries in the cells of the 
contingency table should be five or greater. In this case, using the observa- 
tional instrument, one effectively is dealing with a 26 x 26 contingency table 
(26 total categories of the observational instrument used in this case with 
one dimension for each of two observers, each observation being 
composed of approxiniiitcly twenty recordings), Thus, one would appro- 
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priately place twenty recordings in p 26 x 26 matrix. Collapsing levels 
of the classification are not possible as it would be extremely difficult, 
at best, to interpret the results of such a procedure. Additionally, one 
would need to collapse to a 2 x 2 table to obtain an expected value of 
five entries per cell, i.e., four cells with five entries each = 20. It 
is not reasonably possible then to obtain an expected entry of five per 
cell and thus not possible to obtain an estimate of the relationship of 
inter-observer agreement. 

The data used for this comparison were obtained by employing Systematic 
Who-to-V.Tiom Analysis Notation (Swan, 1971), which is an observational in- 
strument based on the overt behavioral components of the representative ob- 
jectives of Developmental Therapy (Wood, 1972), a treatment approach for 
emotionally disturbed children. The instriijuent is composed of eight major 
and sixteen minor categories (a total of 24 categories) based on various 
subsets of the Developmental Therapy objectives. The basic outline of the 
system is shown in Table 5. One category is recorded every three seconds 
in the appropriate who-to-whom column of the who-to-whom observation sheet 
and each observation period is approximately one minute in length. 



Insert Table 5 about here 



Four sets of observational data were obtained for the comparison of 
the indices, l-.ach set i^ from a SK.W criterion training session (composed 
of video-tapes). For each set of tapes there are three coefficients, 
(P> Hl5 ^^^^^ computed for the two sots of criteria for observer agree- 
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TABLE 5 

Systematic I'.'ho-to-h'hom Analysis Notation 
(SWAN) 



1. OBSERVERS 

0 

In response to child's name being called. . 

Observes one who is talkine °" 

^ OT 

2. PHYSICAL CONTACT . 

C 

Inappropriate . . 

Receives . . C- 

CR 

3. FOLLOWS DIRKCTION'S 

F 

Does not follow directions 

F- 

4. WORKS 

W 

Works, but not appropriately sitting 

5. VERBALIZES . . . 

V 

Inappropriate 

Non-understandable. ...*.'.*.' ^" 

I -statement .*.*!.'! 

Group rules .' .' 

In response ... VG 

VR 

6. PHYSICAL ACTIVITY . . 

A 

Inappropriate 

Parallel play .... A- 

Play ... P+ 

P 

7. RESPON'DING ACTIVITY . 

RA 

8. NON-l)IRnCTi:i) ACTIVITY . 

N 

Kt-moval from view 

Removal from view by teacher ..*.'.*] 
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ment E and C (See Table 1), and each computed foi each pair of observers. 
Thus, there are six coefficients each with a median and a range as si.own in 
Tables 6, 7, 8, and 9. 



Insert Tables 6, 7, 8, and 9 here 



Discussion 

The medians of all three coefficients for all four sets of data for 
the C condition are slightly higher than or equal to those for the E con- 
dition. This is expected as tlie more stringent the criteria for agrjeroent 
the less the number of agreements. Thie ranges of all thiee coefficients 
for all four sets of data for the C condition are slightly smaller than or 
equal to those for the C condition and this is expected aa per the same 
rationale. 

I*or both conditions, and all four sets of data (except for one case), 
Bernstein's coefficient has the highest median and the smallest range; and 
pi has the lowest median and tlie lowest range. The exception to this state- 
ment occurs in Table 9 where the ranges of Bernstein's coefficient and £i_ 
are identical. This occurs because percent of agreement for one case of 
inter-observer agreement was less than 31% and this results in Pj^ equal to 
.00. The relationship for the medians nolds for tnis case. Sliglit 
variations exist between the sets of data with respect to the size of 
the differences between the medians and the ranges. 
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TAB Li- 6 
Data Sot I 
Inter-Observer RcHalulity Coef<-M:i •• - 





Percent 
Agreement 


Bernstein's 

Pb 




Scott's 
£1 




E C 


E 


C 


E 


C 


Range 


55-95 65-95 


.67-. 97 


.77-. 97 


.32-93 


.44. -.93 


Median 


'80 85 


.89 


.92 


.66 


.71 



Based on three observers reviewing seven tapes producing 21 
estimates of each coefficient for each condition. 



TAB LI; 7 
Data Set II 
liter-Observer Accuracy Coefficients 





Percent 


Bernstein' s 


Scott's 




Agreement 




Pi 




E C 


E C 


E C 


Range 


55-95 60-95 


.57-. 97 .72-. 97 


.00-. 91 .10-. 91 


Median 


75 76 


. 85 . 86 


.43 .48 



Based on three observers reviewing seven tapes producing 21 
estimates of each coefficient for each condition. 

ERIC 
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TABLE 8 
Data Set III 
Inter-Observer Accuracy Coefficients^ 





Percent 
Agreement 


Bernstein's 


Scott's 


H C 


E C 


E C 


Range 71-100 75-100 


.83-1.00 .85-1.00 


.31-1.00 .54-1.00 


Median 89 89 


.94 .95 


• 76 . 82 





TABLE 9 
Data Set IV 
Inter-Observer Accuracy Coefficients^ 



Percent 
Agreement 



Bernsjtein's 
Pu 



Scott ' s 
Pi 



Range 40-100 50-100 



Mediaji 85 



85 



.00-1.00 .00-1.00 .00-1.00 .00-1.00 



.92 



.92 



.59 



.63 



Based on three observers reviewing seven tapes oroJucina 7^ 
estimates of each coefficient for each conditio!;!' Producing 21 
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Conclusio ns 

The educational importance of this study concerns the practical applica- 
tion of these indices. If the assumptions are satisfied for a particular 
coefficient, the user must be aware of the nature of the coefficient and 
its behavior in order to interpret his results for the reader. It is 
particularly in the area of inter-observer accuracy that the user is often 
simply looking for some index, and the results are often presented without 
being interpreted for the reader. It is the writcr*s responsibility to in- 
terpret these values to lus readers, either in terms of significance levels 
or in terms of the functioning of the index. 

One would for example interpret a resulting inter-observer figure of 
.75 differently depending on whetlier it is a Be^nstein^<; (19G8) coefficient 
or a Scott's (19SS) £i^. If it is a Bernstein*^ (1968) coefficient, the .75 
is not extremely large, while if it is a Scott's jj^, the .75 is very large. 
If the sample size of recordings is large enough, and the .75 represents a 
calculated contingency coefficient, one would need to compare such to Cj^ax* 

Thus, those individuals who use a specific index should be aware of 
the variability of the specific index used and the functioning of that in- 
dex and its range in order to enable them to interpret more clearly the de- 
gree of inter-observer accuracy with respect to the constraints implied by 
the index. 



rOOTNOTHS 
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